---
id: qa-agent-running-evaluations
title: Running Evaluations on QA Agents
sidebar_label: Running Evaluations
---

To begin running evaluations on our QA Agent, we'll need two things:

1. An `EvaluationDataset` populated with the generated answers (`actual_output`) and the `retrieval_context` from the knowledge base for each question.
2. A selection of metrics to evaluate our `EvaluationDataset` on.

In the previous section, we defined `AnswerRelevancy` and `Faithfulness` as our metrics of choice. While we already have an `EvaluationDataset`, it has not yet been populated with these parameters. Our first step is to generate the actual outputs and retrieval contexts for each `Golden` for evaluation.

:::info
You'll need to login to Confident AI before you begin running evaluations. To do so, run the following command in your CLI:

```bash
deepeval login
```

:::

## Populating the Goldens

Before we populate the Goldens, we need access to them. We'll do this by pulling the synthetic dataset we generated from Confident AI, providing its unique alias:

:::note
By default, `auto_convert_goldens_to_test_cases` is set to `True`, but we'll set it to `False` for this tutorial since the actual output is a required parameter in an `LLMTestCase`, and we haven't generated them yet.
:::

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull("MadeUpCompany QA Dataset", auto_convert_goldens_to_test_cases=False)
```

Next, we’ll iterate through each `Golden`'s inputs and generate the `actual_output` and `retrieval_context` for each synthetic user query.

```python
from deepeval.test_case import LLMTestCase

for golden in dataset.goldens:
    # Compute actual output and retrieval context
    actual_output = qa_agent.generate(golden.input)  # Replace with logic to compute actual output
    retrieval_context = qa_agent.get_retrieval_context()  # Replace with logic to compute retrieval context

    dataset.add_test_case(
        LLMTestCase(
            input=golden.input,
            actual_output=actual_output,
            retrieval_context=retrieval_context
        )
    )
```

## Running Evaluations

With our dataset ready, we can begin running evaluations. Pass the populated dataset along with the selected metrics to the `evaluate` function. Additionally, log the hyperparameters defined in the introduction section to benchmark them against other sets of hyperparameters in future iterations.

```python
from deepeval import evaluate

...
evaluate(
  dataset,
  metrics=[answer_relevancy_metric, faithfulness_metric],
  hyperparameters={"model": model, "prompt template": prompt_template, "top-k": top_k}
)
```
